Search: \.*
|
Topics in DLibrary web:
|
Changed: GMT
|
Changed by:
|
Here is a simpler ToDo list.
See ProxyStorageInterface for details on how we interface between the Proxy server and the RemoteRepository.
The list of MatulyasQuestions and RJs answers has moved.
DiSC is the new name for the DLibrary project
-- RjHonicky - 10 Mar 2004
Here are some starting points for online documents which might belong in a DiSC repository
- project guttenberg - online books
- citeseer - computer science articles
- looksmart (www.findarticles.com) - articles database
- rfc-editor (www.rfc-editor.org) - rfcsa
- science-articles (arxiv.org) - e-print ...
- bartleby.com - online literature, reference nofiction
-- RjHonicky - 10 Mar 2004
Attach papers here by clicking on the "Attach" link at the bottom
- How do laptops join a network?
- Only cache own requests, but can search the entire repository
- They can also contribute their documents (which they supposedly got while they were surfing elsewhere) towards populating the cache.
- Cache has an advantage over google in that it can watch what is a popular document instead of inferring by links
- Documents with high hit rate score better?
- How does the graph of and flow between links relate to the g of users and the flow
- ie collaborative filtering
- If search engines return extra metadata about the query (schema info?), how can we use it?
- If a users stays on a page, we can infer that the keywords match the document well (at least for that user...)
- Flow between documents/users could also be used to filter for data subsets
- all (major) know bugs seem to have been resolved!
Data Collection
- Prof Brewer has a trace of web-traffic from the router here at Berkeley: old
- We might want to build a corpus to test the library type of access: do this by playing back traces
Data Interpretation
- Lucene can’t handle anything except plain text. So, if we want to go ahead with the corpus idea, we will need to have specific content handlers.
RJ: handles pdf and html too
Architecture
Proxy to Internet
Data Distribution mechanism
Indexing mechanism
Searching Mechanism
Proxy to user interface
User Interface
Issues
- Proxy to the Internet
- No issues for us. We will be using Smart Cache as Fred suggested. However, it probably does more that we want. So we need to be careful
Fred, Smart Cache looks fine, but the code is very poorly written, and most comments are in czech. How wedded to Smart Cache are you?
- Data Distribution Mechanism
- Redundancy: We can post phone this one for now: now no redundancy, it is a cache, not a data source
- Which machine stores the data? The machine with long term storage may be different than the machine that requests this data. hash the url
- Do we hash on keywords? In this case, how do we determine the keywords? Do we use a subset of the index that Lucene generates for the document? no, hash on url
- Does each machine cache independently?: There is no overlap in the caches: the set of documents cached by each server is disjoint.
- Indexing mechanism
- Searching mechanism
- We use Lucene’s search for searching on an individual machine
- Proxy to the user interface
- Does it contact the indexing interface or does it contact the searching mechanism? both
- Depending upon whether we hash keywords or allow every machine to cache what it wants or something else, this proxy would need to one, all or a certain number of the machines part of this infrastructure. This sentence doesn’t make sense, but I assume you are asking whether each machine has its own proxy: yes
- How do we merge at the end machine? Do we get equal number of hits from each cache, do we get all the hits, or do we get more hits from a cache that has more relevant data and less from which has less relevant data?: lucene handles this
- User Interface
- I would say that this is not a big issue, at least for the meanwhile. Indeed, however it must be done. How much time?
- Need to integrate with a web-browser? no
- Should differentiate between offline and online content? This begs the question: how do we know what the online content is supposed to be? Do we cache results of online searches (and some documents that we fetched as opposed to not caching at all or caching all the documents that the online search returned)?
Fred is doing this right now
- For the hash algorithm, we should use the one I just published: I’ll send it
- For cache replacement, LRU
-- RjHonicky - 18 Nov 2003
We are currently designing experiments which utilize DLibrary. Matulya wrote a quick word document outline, which I have ported into cvs in the documentation module. Matulya's version is here for reference.
RunningDiSC describes how to check out and run everything
-- RjHonicky - 02 Dec 2003
The proxy server (RabbIT) currently stores meta-data and data seperately: there is a single file which contains cache data, plus the header of each of the request which resulted in the object being cached. This file is managed by the NCache and NCacheEntry objects. This is great, since it provides a single class through which all the metadata.
Data, however, is managed by several classes since it is streamed, rather than read and then written (a good design for a proxy cache, but which complicates things for us). The relevant classes are
- handler.BaseHandler
- proxy.PartialCacher
- ...
Both interfaces (lucene and rabbit) want to deal with streams, so I have defined a set of classes that allow you to pass references to streams across process boundaries. RemoteIO, I define a detailed interface for remote streams. The code is in CVS under the package remoteio.
With remote streams, the interface to the RemoteRepository works as follows
- Somebody adds a cache entry, based on the HTTP header. I think this is
BaseHandler
- Adding an
NCacheEntry to the NCache causes the cache to send an RMI call to the RemoteRepository which creates a pending entry, and returns a RemoteOutputStream
- The
RemoteOutputStream is added to the NCacheEntry
- Next,
BaseHandler adds the stream from the cache to the MultiStream that it maintains to multiplex the data from the HTTP response to the client and a file. The RemoteOutputStream replaces that file.
- Finally, when the stream is closed, the
RemoteRepository is notified, and the new cache data is indexed.
In the RankingExperiment, the cache is first primed using a trace, or perhaps simply from usage over a period. A set of users are then given a list of search queries to perform. These queries can be of two types
- Do the following exact query (eg "president bush incompetent")
- Run a query to find pages discussing something (eg. the incompetence of "president" bush)
The second type of query allows the user to select the semantic meaning of the query, instead of forcing them to interpret or guess the query's meaning. If the semantic meaning of the query were supplied with the exact query, the results could be skewed by an inexact match between the query and the meaning provided. Both types of queries are therefore supplied, and anylized seperately.
The query is then performed against both the cache and Google.com, retreiving up to some maximum number of documents each. These results are randomly mixed and merged, and displayed for the user. The url, title, and summary of the document are not displayed, so that the user is not biased by the quality of the summary, title extraction, url cleanup etc, but rather only considers the ranking of the documents.
The user is asked to chose the top five documents in the result set. The user browses all of the results and ranks the top five.
For each of the top five documents chosen by the user, if the documents is one of Google's top five, the google gets a point. If the document is one of the cache's top five, the cache gets a point.
Google's score is then subtracted from the cache's score. A mean and standard deviation is caculated for the difference, over a large set of queries. By gathering rankings on the same query from several users, we can also calculate a standard error for our results.
This experiment can be repeated under different conditions:
- larger/smaller cache size
- more/fewer users (generating requests, not users ranking search queries)
- different traces
- different user groups
- users in the same research group
- users in the same institution
- users in the same library
RemoteOutputStreamProxy, RemoteInputStreamProxy
----------------------- ----------------------
+ RemoteOutputStreamProxy(Integer streamKey,
URL serverUrl):RemoteOutputStreamProxy
creates a Serlializable proxy object which can be passed back
from a remote call
- serverURL:URL
- streamKey:Integer
- stream:RemoteOutputStreamServer
if serialize must be implemented, then it will not transfer the
stream object: it will be null when deserialized, so that the
client is forced to reopen the server
o implements the OutputStreamInterface
o caches writes to do them a block at a time
o flush will first write and then call flush remotely
o all operations open the stream on demand
RemoteOutputStreamServer
------------------------
- streamMap: HashMap<OutputStreams>
= addStream(OutputStream stream):RemoteOutputStreamProxy
+ write(Integer streamKey, toWrite:byte[]):void
writes toWrite.length bytes to the remote stream
+ flush(Integer streamKey):void
flushes the remote stream
+ close(Integer streamKey):void
closes the remote stream and removes it from streamMap
RemoteInputStreamServer
-----------------------
- streamMap: HashMap<elements are InputStreams>
= addStream(InputStream stream):RemoteInputStreamProxy
+ read(Integer streamKey, int maxLength):byte[]
reads up to maxLength bytes from the remote stream
+ close(Integer streamKey): void
closes the remote stream
+ mark(Integer streamKey, int readLimit): void
Marks the current position in the remote stream.
+ markSupported(Integer streamKey):boolean
Tests if this input stream supports the mark and reset methods.
+ reset(Integer streamKey): void
Repositions this stream to the position at the time the mark method
was last called on this input stream.
+ skip(Integer streamKey, long):long
Skips over and discards n bytes of data from this input stream.
-- RjHonicky - 18 Nov 2003
This is a page which describes for developers how to get things up and running.
First check out everything:
export CVS_RSH=ssh
cvs -z3 -d:ext:developername@cvs.sourceforge.net:/cvsroot/dlibrary co dlibrary
cvs -z3 -d:ext:developername@cvs.sourceforge.net:/cvsroot/dlibrary co RabbIT2
cvs -z3 -d:ext:developername@cvs.sourceforge.net:/cvsroot/dlibrary co documentation
next set up your shell:
cd dlibrary
. setClasspath.sh
for sh, bash, etc., or
source setClasspath.csh
for csh, tcsh, etc.
Now build everything:
make
Note, you must have make installed, and this probably only works on Unixish systems
cd ../RabbIT2
./jmake
Next make a directory for the repository:
mkdir /var/tmp/repository
ln -s /var/tmp/repository/rabbit.conf conf/rabbit.conf
Now you're ready to start up DiSC. Make sure that you are still in the RabbIT2 directory, and do the following:
java dlibrary.RemoteRepositoryImpl /var/tmp/repository localhost 2345&
java rabbit.proxy.Proxy
You will get some debugging output on the screen for each page you download.
-- RjHonicky - 03 May 2004
The search interface can be accessed by going to the following url:
http://localhost:9666/FileSender/search/search_cache.html
You must log in to use this interface: add yourself to the conf/users file (the format is username:password), or you can use the default user RabbIT:RabbIT.
The search interface currently returns a linked title, plus a summary. The summary that the HTML parser currently generates is crap: it just gives the first 100 or so tokens in the document. A better summarizer would be very handy, or even better, the context of the hit, like google does.
I have also added a few pages to perform the RankingExperiment. The search page is at
http://localhost:9666/FileSender/search/search_both.html
A guest of this TWiki web, not unlike yourself. You can leave your trace behind you, just add your name in TWikiRegistration and create your own page.
Personal Preferences (details in TWikiVariables)
- Horizontal size of text edit box:
- Vertical size of text edit box:
- Default state of the link check box in the attach file page:
Related topics
List of TWiki users
Please take the time and add yourself to the list. To do that fill out the form in TWikiRegistration. This will create an account for you which allows you to edit topics.
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Note: Do not edit this topic to add a user, use TWikiRegistration instead.
Related topics: OfficeLocations?, TWikiGroups?
See also: KnownBugs, InterestingIdeas
- Add in support for PDF via plugins provided by lucence (there is an interface in
RemoteRepositoryImpl for adding content types)
- Finish the search interface
- currently only spits back useless relative file pathes: should give back clickable urls
- Should also spit back a summary and title for each search result
- Find the traces
- write a parser for the traces
- interface the parser with a HTTP client
- write a program to drive HTTP client through with parser through the proxy
- add batch mode to the
RemoteRepositoryImpl to speed things up for the trace driven experiments
|
Topics in DLibrary web:
|
Changed: now 13:16 GMT
|
Changed by:
|
|
RunningDiSC
|
03 May 2004 - 01:18 - NEW
|
RjHonicky
|
|
This is a page which describes for developers how to get things up and running. First check out everything: export CVS RSH ssh cvs z3 d:ext:developername@cvs.sourceforge ...
|
|
|
ProjectDocs
|
03 May 2004 - 01:00 - r1.5
|
RjHonicky
|
|
We are currently designing experiments which utilize DLibrary. Matulya wrote a quick word document outline, which I have ported into cvs in the documentation module ...
|
|
|
DocumentSources
|
10 Mar 2004 - 01:23 - NEW
|
RjHonicky
|
|
Here are some starting points for online documents which might belong in a DiSC repository project guttenberg online books citeseer computer science articles looksmart ...
|
|
|
WebHome
|
10 Mar 2004 - 01:20 - r1.17
|
RjHonicky
|
|
Folks, if you've never used a wiki, learn about wiki webs on the TWiki.GoodStyle page. News: Main.RjHonicky 08 Dec 2003 : Fixed a locking problem (see KnownBugs ...
|
|
|
InterestingIdeas
|
05 Mar 2004 - 14:09 - r1.5
|
RjHonicky
|
|
How do laptops join a network? Only cache own requests, but can search the entire repository They can also contribute their documents (which they supposedly got while ...
|
|
|
ProjectLinks
|
14 Dec 2003 - 02:18 - r1.7
|
RjHonicky
|
|
The sourceforge project page http://sourceforge.net/projects/dlibrary/ RabbIT: a transcoding proxy http://www.khelekore.org/rabbit/readme.shtml Jakarta Lucene, an ...
|
|
|
SearchInterface
|
07 Dec 2003 - 00:49 - NEW
|
RjHonicky
|
|
The search interface can be accessed by going to the following url: http://localhost:9666/FileSender/search/search cache.html You must log in to use this interface ...
|
|
|
ToDo
|
01 Dec 2003 - 06:38 - r1.4
|
RjHonicky
|
|
See also: KnownBugs, InterestingIdeas Add in support for PDF via plugins provided by lucence (there is an interface in RemoteRepositoryImpl for adding content types ...
|
|
|
RemoteIO
|
18 Nov 2003 - 18:12 - NEW
|
RjHonicky
|
|
RemoteOutputStreamProxy, RemoteInputStreamProxy RemoteOutputStreamProxy(Integer streamKey, URL serverUrl):RemoteOutputStreamProxy creates a Serlializable proxy object ...
|
|
|
DLibraryDesign
|
18 Nov 2003 - 18:09 - NEW
|
RjHonicky
|
|
Here is a simpler ToDo list. See ProxyStorageInterface for details on how we interface between the Proxy server and the RemoteRepository. The list of MatulyasQuestions ...
|
|
|
WebPreferences
|
18 Nov 2003 - 08:00 - r1.2
|
RjHonicky
|
|
TWiki.DLibrary Web Preferences The following settings are web preferences of the TWiki.DLibrary web. These preferences overwrite the site-level preferences in TWIKIWEB ...
|
|
|
TWikiGuest
|
18 Nov 2003 - 07:35 - NEW
|
TWikiGuest
|
|
A guest of this TWiki web, not unlike yourself. You can leave your trace behind you, just add your name in TWIKIWEB .TWikiRegistration and create your own page. Personal ...
|
|
|
TWikiUsers
|
14 Nov 2003 - 01:33 - r1.16
|
RjHonicky
|
|
List of TWiki users Please take the time and add yourself to the list. To do that fill out the form in TWIKIWEB .TWikiRegistration. This will create an account for ...
|
|
Number of topics: 22
Folks, if you've never used a wiki, learn about wiki webs on the GoodStyle page.
News:
- -- RjHonicky - 08 Dec 2003 : Fixed a locking problem (see KnownBugs), and added a multi-threaded script to read in urls and place load on the proxy
- -- RjHonicky - 07 Dec 2003 : Checked in a new SearchInterface: more information for results, and also a page to do the RankingExperiment
- -- RjHonicky - 06 Dec 2003 : Fixed a bug with locking (see KnownBugs) and also finished the search interface. Fred, you need to check out these changes and see how they effect your support for PDF
- -- RjHonicky - 01 Dec 2003 : Foxed problem with indexes getting overwritten when new files are added. Also, files are removed from the index when they are removed from the cache.
- -- RjHonicky - 30 Nov 2003 : Fixed proxy-to-repository integration problem, so that files now actually load from the cache when they are cached. Check out both dlibrary and RabbIT2 to see the changes. I changed the repository root to /tmp/repository so make sure you move the symlink to rabbit.conf to /tmp/repository
- -- RjHonicky - 22 Nov 2003 : Fixed proxy configuration to not gzip files
- I think I also eliminated image transcoding, so that needs to be readded, and files should be gziped in storage eventually too
- -- RjHonicky - 22 Nov 2003 : Added search interface to RabbIT:
- -- RjHonicky - 21 Nov 2003 : Added parsing support for mime type text/html
- -- RjHonicky - 21 Nov 2003 : I just checked in a version of the code which stores data and metadata remotely. Please checkout the following modules
Starting Points:
- Check out the ToDo, KnownBugs and InterestingIdeas pages for pending tasks
- On the ProjectLinks page, put relevant links.
- On the DocumentSources page, put good sources of documents for bulk upload into DiSC
- On the DownloadedPapers page, upload papers that you get the PDF or Postscript for
- The DLibraryDesign page has issues related to the design, and ProjectDocs has other interesting information and writeups
- Put announcements on this page.
- Feel free to change things as you see fit: everything is versioned anyway, so if you mess up, no big deal
- search-traces.txt: seach queries
|
Topics in DLibrary web:
|
Changed: now 13:16 GMT
|
Changed by:
|
|
DLibraryDesign
|
18 Nov 2003 - 18:09 - NEW
|
RjHonicky
|
|
Here is a simpler ToDo list. See ProxyStorageInterface for details on how we interface between the Proxy server and the RemoteRepository. The list of MatulyasQuestions ...
|
|
|
DocumentSources
|
10 Mar 2004 - 01:23 - NEW
|
RjHonicky
|
|
Here are some starting points for online documents which might belong in a DiSC repository project guttenberg online books citeseer computer science articles looksmart ...
|
|
|
InterestingIdeas
|
05 Mar 2004 - 14:09 - r1.5
|
RjHonicky
|
|
How do laptops join a network? Only cache own requests, but can search the entire repository They can also contribute their documents (which they supposedly got while ...
|
|
|
ProjectDocs
|
03 May 2004 - 01:00 - r1.5
|
RjHonicky
|
|
We are currently designing experiments which utilize DLibrary. Matulya wrote a quick word document outline, which I have ported into cvs in the documentation module ...
|
|
|
ProjectLinks
|
14 Dec 2003 - 02:18 - r1.7
|
RjHonicky
|
|
The sourceforge project page http://sourceforge.net/projects/dlibrary/ RabbIT: a transcoding proxy http://www.khelekore.org/rabbit/readme.shtml Jakarta Lucene, an ...
|
|
|
RemoteIO
|
18 Nov 2003 - 18:12 - NEW
|
RjHonicky
|
|
RemoteOutputStreamProxy, RemoteInputStreamProxy RemoteOutputStreamProxy(Integer streamKey, URL serverUrl):RemoteOutputStreamProxy creates a Serlializable proxy object ...
|
|
|
RunningDiSC
|
03 May 2004 - 01:18 - NEW
|
RjHonicky
|
|
This is a page which describes for developers how to get things up and running. First check out everything: export CVS RSH ssh cvs z3 d:ext:developername@cvs.sourceforge ...
|
|
|
SearchInterface
|
07 Dec 2003 - 00:49 - NEW
|
RjHonicky
|
|
The search interface can be accessed by going to the following url: http://localhost:9666/FileSender/search/search cache.html You must log in to use this interface ...
|
|
|
TWikiGuest
|
18 Nov 2003 - 07:35 - NEW
|
TWikiGuest
|
|
A guest of this TWiki web, not unlike yourself. You can leave your trace behind you, just add your name in TWIKIWEB .TWikiRegistration and create your own page. Personal ...
|
|
|
TWikiUsers
|
14 Nov 2003 - 01:33 - r1.16
|
RjHonicky
|
|
List of TWiki users Please take the time and add yourself to the list. To do that fill out the form in TWIKIWEB .TWikiRegistration. This will create an account for ...
|
|
|
ToDo
|
01 Dec 2003 - 06:38 - r1.4
|
RjHonicky
|
|
See also: KnownBugs, InterestingIdeas Add in support for PDF via plugins provided by lucence (there is an interface in RemoteRepositoryImpl for adding content types ...
|
|
|
WebHome
|
10 Mar 2004 - 01:20 - r1.17
|
RjHonicky
|
|
Folks, if you've never used a wiki, learn about wiki webs on the TWiki.GoodStyle page. News: Main.RjHonicky 08 Dec 2003 : Fixed a locking problem (see KnownBugs ...
|
|
|
WebPreferences
|
18 Nov 2003 - 08:00 - r1.2
|
RjHonicky
|
|
TWiki.DLibrary Web Preferences The following settings are web preferences of the TWiki.DLibrary web. These preferences overwrite the site-level preferences in TWIKIWEB ...
|
|
Number of topics: 22
See also the faster WebTopicList?
The following settings are web preferences of the TWiki.DLibrary web. These preferences overwrite the site-level preferences in TWikiPreferences, and can be overwritten by user preferences (your personal topic, i.e. TWikiGuest in the TWiki.Main web)
Preferences:
- List of topics of the TWiki.DLibrary web:
- Web specific background color: (Pick a lighter one of the StandardColors)
- List this web in the SiteMap:
- If yes, Set SITEMAPLIST =
on, and add the "what" and "use to..." description for the site map. Make sure to list only links that include the name of the web, e.g. DLibrary.Topic links.
- Set SITEMAPLIST = on
- Set SITEMAPWHAT = TWiki... The goods
- Set SITEMAPUSETO = ...get a first-hand feel for TWiki possibilities.
- Exclude web from a
web="all" search: (Set to on for hidden webs)
- Default template for new topics and form(s) for this web:
- WebTopicEditTemplate?: Default template for new topics in this web. (Site-level is used if topic does not exist)
- TWiki.WebTopicEditTemplate: Site-level default template
- TWikiForms: How to enable form(s)
- Set WEBFORMS =
- Users or groups who are not / are allowed to view / change / rename topics in the DLibrary web: (See TWikiAccessControl)
- Set DENYWEBVIEW =
- Set ALLOWWEBVIEW =
- Set DENYWEBCHANGE =
- Set ALLOWWEBCHANGE =
- Set DENYWEBRENAME =
- Set ALLOWWEBRENAME =
- Users or groups allowed to change or rename this WebPreferences topic: (I.e. TWikiAdminGroup)
- Set ALLOWTOPICCHANGE =
- Set ALLOWTOPICRENAME =
- Web preferences that are not allowed to be overridden by user preferences:
- Set FINALPREFERENCES = WEBTOPICLIST, DENYWEBVIEW, ALLOWWEBVIEW, DENYWEBCHANGE, ALLOWWEBCHANGE, DENYWEBRENAME, ALLOWWEBRENAME
Notes:
- A preference is defined as:
6 spaces * Set NAME = value
Example:
- Preferences are used as TWikiVariables by enclosing the name in percent signs. Example:
- When you write variable
%WEBBGCOLOR% , it gets expanded to #FFFFF0 .
- The sequential order of the preference settings is significant. Define preferences that use other preferences first, i.e. set
WEBCOPYRIGHT before WIKIWEBMASTER since %WEBCOPYRIGHT% uses the %WIKIWEBMASTER% variable.
- You can introduce new preferences variables and use them in your topics and templates. There is no need to change the TWiki engine (Perl scripts).
Related Topics:
- Jump to topic: If you already know the name of the topic, enter the name of the topic at the second line of this page.
- WebChanges: Find out what topics in DLibrary have changed recently.
Number of topics: 22
|
|
Copyright © 1999-2003 by the contributing authors.
All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback.
|